Experiments to investigate the utility of nearest neighbour metrics based on linguistically informed features for detecting textual plagiarism

نویسندگان

  • Per Almquist
  • Jussi Karlgren
چکیده

Plagiarism detection is a challenge for linguistic models — most current implemented models use simple occurrence statistics for linguistic items. In this paper we report two experiments related to plagiarism detection where we use a model for distributional semantics and of sentence stylistics to compare sentence by sentence the likelihood of a text being partly plagiarised. The result of the comparison are displayed for visual inspection by a plagiarism assessor. 1 Plagiarism detection Plagiarism is the act of copying or including another author’s ideas, language, or writing, without proper acknowledgment of the original source. Plagiarism analysis is a collective term for computer-based methods to identify plagiarism. (Stein et al., 2007a) Plagiarism analysis can be performed intrinsically — a text is examined for internal consistency, to detect suspicious passages that appear to diverge from the surrounding text, or externally — a text is inspected with respect to some known corpus to find passages with suspiciously similar content to other text. In external plagiarism detection, it is assumed that the source document dsrc for a given plagiarized document dplg can be found in a target document collection D. Typically, plagiarism detection then proceeds in three stages: 1. candidate selection through retrieval of a set of candidate source documents Dsrc is retrieved from Dplg; 2. candidates dsrc from Dsrc is compared passage by passage with the suspicious document dplg and every case where a passage from dplg appears to be similar to some passage in some dsrc is noted; 3. followed by some post-processing to remove false hits.(Stein et al., 2007b; Potthast et al., 2010) 2 PAN workshop series A series of workshops on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, organised since 2007, have provided the field with a shared task and test materials in the form of gold standard text collections with manually and automatically constructed plagiarised sections marked for experimental purposes. Some of the plagiarised sections are obfuscated with word replacement, edits, and permutations. The research results from the workshops are comparable, since they are to a large extent performed on the same materials using the same starting points and same target measures. Example results relevant to this study (and on the whole none too surprising) are that unobfuscated plagiarism can be detected with a reasonable accuracy by the top plagiarism detectors. The recall decreases slightly with increasing obfuscation and that longer stretches of plagiarised material are easier to detect than shorter segments.(Potthast et al., 2010) Table 1: Stylometric features Name Description arg Sentence is argumentative (merely, for sure, ... ) cog Sentence describes cognitive process (remember, think, ...) com Sentence is complex (average word length > 6 characters or sentence length > 25 words) date Sentence contains one or more date references fin Sentence contains a money symbol or a percentage sign fpp Sentence contains first person pronouns le Sentence refers to named entities such as a person or an organization loc Sentence mentions a location neg Sentence contains a grammatical negation num Sentence contains numbers pa Sentence contains place adverbials (inside, outdoors ... ) pun Sentence contains punctuation in addition to its ending punctuation se Sentence contains split infinitives or stranded prepositions spp Sentence contains second person pronouns sub Sentence has subordinate clauses ta Sentence contains time adverbials (early, presently, soon ... ) tim Sentence contains one or more time expression tpp Sentence contains third person pronouns uni Sentence contains symbols representing a unit of measurement 3 Our experimental set-up The base of the experiment described here is to test a finer-grained analysis of plagiarised texts than other previous work. We use a sentence-bysentence comparison of the suspicious text (dplg) with all sentences of each target text (dsrc) in Dsrc using two different similarity measures: one based on overall semantic similarity, the other on specific stylometric measures. The experiment is not a full scale evaluation of our method but is intended to test the practicability of our approach. Given that we have a suspicious text and some reasonable number of candidate source texts (through some retrieval procedure) — can we detect the likelihood of plagiarism in a text by inspecting the sentence sequence of the suspicious text one by one? This paper reports a selected plot dry run of the methodology performed over a number of sample texts. A full scale evaluation is pending.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

Plagiarism Detection in Programming Assignments Using Deep Features

This paper proposes a method for detecting plagiarism in source-codes using deep features. The embeddings for programs are obtained using a character-level Recurrent Neural Network (char-RNN), which is pre-trained on Linux Kernel source-code. Many popular plagiarism detection tools are based on n-gram techniques at syntactic level. However, these approaches to plagiarism detection fail to captu...

متن کامل

Tags Re-ranking Using Multi-level Features in Automatic Image Annotation

Automatic image annotation is a process in which computer systems automatically assign the textual tags related with visual content to a query image. In most cases, inappropriate tags generated by the users as well as the images without any tags among the challenges available in this field have a negative effect on the query's result. In this paper, a new method is presented for automatic image...

متن کامل

A Novel Hybrid Approach for Email Spam Detection based on Scatter Search Algorithm and K-Nearest Neighbors

Because cyberspace and Internet predominate in the life of users, in addition to business opportunities and time reductions, threats like information theft, penetration into systems, etc. are included in the field of hardware and software. Security is the top priority to prevent a cyber-attack that users should initially be detecting the type of attacks because virtual environments are not moni...

متن کامل

The Effect of Variations in Integrated Writing Tasks and Proficiency Level on Features of Written Discourse Generated by Iranian EFL Learners

In recent years, a number of large-scale writing assessments (e.g., TOEFL iBT) have employed integrated writing tests to measure test takers’ academic writing ability. Using a quantitative method, the current study examined how written textual features and use of source material(s) varied across two types of text-based integrated writing tasks (i.e., listening-to-write vs. reading-to-write) and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011